Embedding the Shortest-Path Metric

Published: Jan 15, 2025

Distance as the crow flies doesn't tell you much if you're looking at a road map. A point across the river may only be 500ft away, but you might need to drive for 20 minutes to find a way over the water. Meanwhile, you can reach a point 500ft down the highway in 5 seconds.

Sadly, straight-line distance on a road map is not proportional to travel time (unless you're looking at a map of a large, empty parking lot).

But wouldn't it be cool if it was? Maybe we can construct such a map — something succinct and easy-to-read, like Vignelli's classic NYC Subway Map. Maybe it could be some warped version of an ordinary map, or possibly some sort of 3D graph.

Unfortunately, making an accurate travel-time map is impossible. In this post, I'll explain exactly why it's impossible. I'll also show a best-approximation of a travel-time map using some real world data.

Heuristics

This isn't just a silly 'what if' question. Having the map I'm describing would mean we have a perfect heuristic for the A* search algorithm. In contrast, Euclidean distance is an admissible heuristic, which is "a speedy, mostly-accurate way to estimate travel time that never over-estimates it".

Euclidean distance tends to consistently under-estimate actual travel distance.

There are many ways to calculate heuristics (e.g. landmarks), but Euclidean distance is an attractive choice because of its low computational cost. Plus, transit data already has coordinates attached to it, which means no preprocessing is necessary. Because of this, some state-of-the-art routing engines still use A* for its flexibility despite the existence of other attractive options.

Note: There is some subtlety here regarding travel-time v.s. travel-distance. For the sake of readability, I'm going to assume these are equivalent for the rest of the article. In practice, it's complicated — especially if you're doing multi-modal routing.

Embeddings

The travel-time map problem lives in the space of networks, confusingly referred to as graphs in pure math. A transportation graph abstracts the nuances of roads into a minimalist schema: a collection of nodes (typically intersections) and edges (road segments) that connect them. Each edge has a weight that represents the time it takes to traverse that road segment.

When we say "make a travel-time map", we mean we want to find a way to embed the nodes of the network in a way that preserves travel time as distance. Specifically, we want an embedding where the Euclidean distance between two nodes is proportional to the minmum travel time between them. This is called an isometric embedding (iso=same, metric=length).

I'll illustrate how one might 'manually' make one of these embeddings in 2D:

Now you try! Maybe this goes without saying, but be sure to only consider the shortest paths between any two points. We want the minimum travel time, not the one where we bounce back and forth between nodes a hundred times. Here's a graph; give it a shot!

...sorry, that was a mean example.

Despite its simplicity, the graph above is isometrically unembeddable in 2D space. And 3D space. And any number of dimensions. The center point must simultaneously lie on the midpoints of all three sides, and that is a no-no.

We're Cooked?

I mean, yeah, we were from the start. I've been talking about undirected graphs this whole time. One-way streets exist, so transportation networks are usually asymmetric directed graphs, which are even more impossible to isometrically embed than the example above.

Still, how common are non-embeddable features? Can we isolate them? How do we numerically identify them? Maybe we could still find approximate embeddings?

It's math time. We'll make a few extremely generous assumptions:

We have no one-way roads or turn restrictions.
Distances between nodes are \(> 0\).
Our road network has a single connected component.

Metric Spaces

The "minimum travel time" function is a metric. If we represent the nodes in our network as a finite set \(A={a_1, a_2, \ldots, a_n}\), the shortest path between two nodes \(x, y\in A\) can be written as a function \(d(x, y)\). We can call \(d(x, y)\) a "metric" if it satisfies:

\(d(x, y)\geq0\)
\(d(x, y) = 0\iff x=y\)
\(d(x, z)\leq d(x, z) + d(x, y)\)
\(d(x, y) = d(y, x)\)

If \(d(x, y)\) is just the minimum travel time between nodes \(x\) and \(y\), then conditions 1, 2, and 3 are automatically satisfied. Meanwhile, condition 4 just asks whether the distance matrix \(M_{ij}=d(a_i, a_j)\) is symmetric, which is true if we don't have one-way roads. We now have a discrete metric space.

Note: To actually find the values of \(d(x, y)\), we can use the Floyd-Warshall algorithm to fill in every non-diagonal entry in our distance matrix \(M\). This is obviously impractical at scale, but we're focused on theory for now.

The task of "embedding a discrete metric space" is what motivates distance geometry, which offers a couple equivalent ways to test whether or not something admits an isometric embedding. I don't want to turn this into a textbook, so I'm going to sketch a few quick proofs that should elucidate some important points for us.

A few lemmas about shortest-path isometric embeddings (with proofs)

Symbols I'll use here include:

\(\mathbf r_n\): The embedding of node \(a_n\) into Euclidean space.
\(P_n\): A linear graph / path graph; a chain of \(n\) nodes.
\(C_n\): A cycle graph / circular graph; a ring of \(n\) nodes (no, not that kind of ring).
\(K_{1, n}\): A star; a single 'central' node connected to \(n\) other nodes.

Additionally, you should append each of these lemmas with "when no shortcuts are present". Obviously, a complete graph contains many cycles, but the presence of shortcuts across these cycles means that the graph might still be able to be isometrically embedded.

Lemma 1: Linear graphs isometrically embed as straight lines.

The embedding of \(P_1\) is a single point \(\mathbf r_1\) anywhere (trivial).
The embedding of \(P_2\) includes an additional point \(\mathbf r_2\) placed \(d(a_1, a_2)\) away from \(\mathbf r_1\).
The embedding of \(P_n\) includes an additional point \(\mathbf r_n\), placed \(d(a_n, a_{n-1})\) away from \(\mathbf r_{n-1}\), and \(d(a_n, a_{n-2})\) away from \(\mathbf r_{n-2}\). But by our metric and the structure of the graph, \(d(a_n, a_{n-2})=d(a_n, a_{n-1})+d(a_{n-1},a_{n-2})\), and the three points form a degenerate triangle, meaning \(\mathbf r_n\), \(\mathbf r_{n-1}\), and \(\mathbf r_{n-2}\) are colinear. The collinearity of all points in a path graph follows by induction.

Lemma 2: Isometric embeddings of linear graphs preserve node ordering. Using the same symbols:

Let \(x_n\) be the projection of \(\mathbf r_n\) onto the line formed by all points in the embedding, and choose \(x_1 > x_0\). It should be clear that \(d(a_i,a_j)=|x_i - x_j|\).
We assume \(x_{n-1} > x_{n-2}\) . To show that \(x_n > x_{n-1}\) everywhere in the embedding, our metric says it must be the case that \(|x_n - x_{n-2}| = x_{n-1}-x_{n-2} + |x_n - x_{n-1}|\). If \(x_{n}>x_{n-1}\), this expression simplifies to \(x_{n-1}=x_{n-2}\), which contradicts \(x_{n-1}>x_{n-2}\). Because we assume points do not overlap, it must be the case that \(x_n>x_{n-1}\). Again, preservation of order follows by induction.

Lemma 3: Cycles with more than 3 nodes are not isometrically embeddable. By contradiction:

Assume we can isometrically embed \(C_n\) with the shortest-path metric. \(C_n\) contains \(n\) linear subgraphs of size \(n-1\), all equivalent to \(P_{n-1}\). We will index them with superscripts: \({P_{n-1}^1, P_{n-1}^2,\cdots,P_{n-1}^n}\).
For \(n>3\), these linear subgraphs have overlapping segments, so they all must be colinear. Consequently, all points in the embedding of \(C_n\) must be colinear, and they share a common projection \({x_1, x_2, \cdots, x_n}\).
The first node of \(P_{n-1}^1\) is embedded at \(\mathbf r_0\), and its last node is embedded at \(\mathbf r_{n-2}\). From the previous lemma, we know \(x_1 < x_{n-1} \implies x_1 < x_3\). But from the projection of \(P^3_{n-1}\), \(x_3 < x_1\), and we have a contradiction.

Lemma 4: Star graphs with more than 3 nodes are not isometrically embeddable. Also by contradiction:

The \((n+1)\)-node star graph \(K_{1,n}\) contains \(n(n-1)/2\) unique copies of length-three linear graphs \(P_3\).
These linear graphs form a connected network of overlapping edges, so they must all be colinear.
Like before, we can 'hop around' these colinear subgraphs and display a contradiction. I'll leave this one as an exercise (hint: odd numbers).

Lemma 5: Denser topologies are isometrically embeddable under some very specific conditions. Uh-oh, we've lost the ability to make concise arguments based on structure. I think we might need some heavier machinery. Strap in:

Gram Matrices

If we have an isometric embedding of \(n\) points \(\phi:A\rightarrow\mathbb R^{n-1}\) where \(\phi(a_i)=\mathbf{r}_i\), we can construct a matrix \(G\) of all possible dot products between the embedded points, \(g_{ij}=\mathbf{r}_i\cdot\mathbf{r}_j\). This matrix is called a a Gram matrix.

Gram matrices have a very nice property that will help us later, so I'd like to set up a correspondence (an invertible map) between them and the Euclidean distance matrices we've been working with. The reason we can't do this right away is that translation preserves distances, but dot products don't: Moving a set of points two meters to the right changes their dot products, but not the distances between them.

To remedy this for now, we'll fix \(\phi(a_1)=\mathbf{r}_1=\mathbf 0\) (so \(||\mathbf{r}_i||=||\mathbf{r}_i-\mathbf{r}_1||\)), which will make the first row and column of our Gram matrix zero. We'll work with the nonzero \((n-1)\times (n-1)\) leading principal minor of the original Gram matrix from here on. Keep in mind that this new matrix still contains the 'distance form zero' information along its diagonal.

The correspondence between Gram matrices and distance matrices is given by the polarization identity (parallelogram rule): \[ \begin{align} &G_{ij}=\mathbf{r}_i\cdot\mathbf{r}_j=\frac{1}{2}\left(||\mathbf{r}_i||^2+||\mathbf{r}_j||^2-||\mathbf{r}_i-\mathbf{r}_j||^2\right)\ \end{align} \]

Which is an easily-derivable property of any inner product space. It is also invertible: \[||\mathbf{r}_i-\mathbf{r}_j||=\sqrt{G_{ii}+G_{jj}-2G_{ij}}\]

This gives us the machinery to freely flicker back and forth between Gram matrices and distance matrices, but we still don't have a way to check if a matrix is a valid distance matrix.

Thankfully, there's not much left to do. Gram matrices explicitly look like the outer product of a list of vectors with itself: \[ G= \begin{bmatrix} (\mathbf{r}_2\cdot\mathbf{r}_2) & (\mathbf{r}_2\cdot\mathbf{r}_3) & \cdots & (\mathbf{r}_2\cdot\mathbf{r}_n) \\ (\mathbf{r}_3\cdot\mathbf{r}_2) & (\mathbf{r}_3\cdot\mathbf{r}_3) & \cdots & (\mathbf{r}_3\cdot\mathbf{r}_n) \\ \vdots & \vdots & \ddots & \vdots \\ (\mathbf{r}_n\cdot\mathbf{r}_2) & (\mathbf{r}_n\cdot\mathbf{r}_3) & \cdots & (\mathbf{r}_n\cdot\mathbf{r}_n) \end{bmatrix} =\begin{bmatrix}\mathbf{r}_2\\\mathbf{r}_3\\\vdots\\\mathbf{r}_n\end{bmatrix}\begin{bmatrix}\mathbf{r}_2&\mathbf{r}_3&\cdots&\mathbf{r}_n\end{bmatrix}=R^TR \]

Conveniently, any Gram matrix can be written as the product of the matrix of embedding vectors times its transpose, \(G=R^TR\). This is equivalent to saying that \(G\) must be positive semidefinite, and it turns out that this is a necessary and sufficient condition for having an embedding in \(\mathbb R^n\), assuming we don't have any sort of degeneracy.

So, in sum, to test if a distance matrix is embeddable:

Using the polarization identity, find dot products for all pairs of elements of \(A\) (but you'll need to choose a 'zero' somehow).
Check whether the resulting Gram matrix is positive semidefinite.

However, keep in mind that we've been working with general distance matrices. Shortest-path matrices are far rarer, and impose a complicated requirement on the distance matrix: let \(j_1, j_2, \cdots, j_n\), represent any sequence of matrix row/column indices. Then the following inequality must hold: \[ M_{ik} \leq M_{ij_1} + M_{j_n k} +\sum_{m=1}^{n-1} M_{j_mj_{m+1}} \] This condition is not so easy to manipulate with the standard linear algebra toolbox, so I doubt that there's an easy way to form valid shortest-path matrices from arbitrary Gram matrices. But now, at least, we have a general way to test whether or not a distance matrix is embeddable: the positive-semidefiniteness of the Gram matrix.

Let's dig a little deeper into this requirement:

Decomposition

A matrix \(M\) is positive-semidefinite if and only if: \[ \mathbf x \cdot (M\mathbf x) \geq 0 \] For all nonzero vectors \(\mathbf x\). This is usually written as \(\mathbf x^T M \mathbf x > 0\), but I think using the dot product makes the meaning clearer. If the dot product of two vectors is > 0, it means they face the same way, so \(M\) is positive-definite if and only if it rotates/scales vectors less than \(90°\). What kinds of matrices always end up flipping vectors? Answer: matrices with negative eigenvalues. The existence of negative eigenvalues is sufficient for positive-semidefiniteness.

This means to test if a network is embeddable, we need to find a spectral decomposition of its Gram matrix \(G\). Because \(G\) is symmetric, it can be decomposed with an orthogonal matrix of eigenvectors, \(Q\): \[ G = Q\Lambda Q^\dagger \] Where \(Q\) holds \(G\)'s eigenvectors in its columns, and \(\Lambda\) is a diagonal matrix of \(G\)'s eigenvalues (in the same order). But if \(\Lambda\) is diagonal, it has a matrix square root \(\Lambda^{1/2}\Lambda^{1/2}=\Lambda\), which is simply \({\Lambda^{1/2}}_{ij}=\sqrt{\Lambda_{ij}}\). Because \(\Lambda^{1/2}\) is also diagonal, we can write: \[ G=Q\Lambda^{1/2}(Q\Lambda^{1/2})^T \] Which means that if \(G=R^TR\), then \(Q\Lambda^{1/2}\) is a valid choice for \(R\). And remember, \(R\) was a matrix of embedded position vectors — we've found a valid embedding!

Multidimensional Scaling

The steps described above are almost a description of classical (a.k.a. Torgerson) multidimensional scaling (MDS), an excellent tool in the dimensionality reduction shed. The only difference is that in MDS, only the largest eigenvalues/vectors are used, giving us a lower-dimensional embedding.

Classical MDS is a concise way to say "principal component analysis (PCA) on the Gram matrix formed from the distance matrix". Given that PCA minimizes squared error, it's possible to show that classical MDS minimizes \(\sum_{i<j}(||\mathbf r_i - \mathbf r_j|| - d(a_i, a_j))^2\) (check Torgerson's original 1952 paper). In an embeddable Euclidean distance matrix, this quantity is zero, but MDS works with any metric — including our shortest-path metric.

Let's give it a shot.

Instead of 'centering' our embedding on \(\mathbf r_1\), it's common practice to mean-center the embedding so that \(\sum_i \mathbf r_i=\mathbf 0\). This centering provides an elegant way to calculate Gram matrices without diving into indices. Given an \(n\times n\) Euclidean distance matrix \(M\), its associated Gram matrix \(G\) is given by: \[ G=-\frac{1}{2}CM^{(2)}C \] Where \((M^{(2)})_{ij}={M_{ij}}^2\) and \(C=I-\frac{1}{n}\mathbf 1^T \mathbf 1\) is a centering matrix (\(\mathbf 1^T \mathbf 1\) just gives us a matrix filled with ones). With NumPy, it looks like this:

import numpy as np
import networkx as nx

def distance_to_gram(M):
    n = M.shape[0]
    C = np.eye(n) - np.ones((n, n))/n
    return -0.5 * C @ np.square(M) @ C # This is called "double-centering"

def shortest_path_MDS(network, ndims=2):
    M = nx.floyd_warshall_numpy(network, weight='w') # Performance bottleneck!
    G = distance_to_gram(M)
    evals, evecs = np.linalg.eig(G)

    # Only take the [ndims] largest eigenpairs
    idx = np.argsort(evals)[::-1]
    evals = evals[idx]
    evecs = evecs[:, idx]

    # Make sure they're positive or numpy yells at us
    positive_evals = np.maximum(evals[:ndims], 0)
    sqrt_evals = np.sqrt(positive_evals)
    embedding = evecs[:, :ndims] @ np.diag(sqrt_evals)
    return embedding

MDS Results

The output of MDS on small graphs seems reasonable:

In general, the embeddings are not so different from force-directed layouts. However, the graphs we're viewing here are far from anything we'd see in the real world.

Let's plug in some actual data. I'll be looking at a subgraph of the network I constructed in an earlier post, specifically, this region of Philadelphia:

The \(O(V^3)\) Floyd-Warshall step makes this pretty slow (at least in NetworkX), but the result is interesting:

If you're thinking that this looks suboptimal and you could do a better job, remember two things:

This embedding minimizes the shortest-path metric. It doesn't care about visual clarity or usability like we do.
Classical MDS minimizes the squared error in distance, and in that way, it's like an average (\(\overline x := \text{arg min}_y \sum_i (x_i - y)^2\)). For heavily skewed distributions, the average isn't always representative of what we care about.

Note: if you're wondering how sklearn's non-classical MDS appears in comparison, the answer is 'bendier', but pretty much the same otherwise. I'd show you how it looks on the network above, but it's *very slow to run.*

Actually, there's something interesting happening in this embedding. Take a look at the 3D version:

Notice how it 'wiggles' around. This is indicative of something deeper; check out this 3D MDS embedding of a 2D lattice with unit edge weights:

See how it forms a saddle? Saddles are a prototypical example of a surface with negative Gaussian curvature. Similarly, the 'wiggle' in the previous example is reminiscent of an embedding of the Hyperbolic plane (a surface with constant negative curvature everywhere) into real space:

Courtesy of Diana Taimina

It seems like the trouble with making these shortest-path embeddings is that they run out of room; the 3-star graph I showed earlier would be easy to embed, if only triangles worked differently. But in hyperbolic space, triangles do work differently. This leads us to the topic of hyperbolic embeddings, which don't bother with \(\mathbb R^n\) and instead embed in hyperbolic space.

I think this article is long enough, so I'll drop a few papers related to shortest-path hyperbolic embeddings:

Geographic Routing Using Hyperbolic Space, R. Kleinberg (2007)
Shortest Path Embeddings of Graphs on Surfaces, Hubard et al. (2016) (some nice theory in here)
Finding shortest and nearly shortest path nodes in large substantially incomplete networks by hyperbolic mapping, M. Kitask et al. (2023)

And call it for now.

The Gist

I think that the conclusion you can pull from this is that any attempt to improve a Euclidean distance heuristic by 'moving nodes around' or 'adding dimensions' will ultimately be limited by some very fundamental math. I'm not saying that doing this won't offer performance gains (follow-up post topic?), but any technique using an embedding in \(\mathbb R^n\) will eventually hit some pretty hard walls.

Finally, if you liked any of the math in this, I recommend Deza and Laurent's Geometry of Cuts and Metrics (1996), specifically Part III.

Previous Post:
Make Your Averages Better-Than-Average Next Post:
Motivation for Polyharmonic Splines

AdamLastowka